This exercise uses the data set BLDPRES.SAV. We focus on the relation between age and systolic blood pressure.We will first look in the women, and then in men.
###################################################
# part a #
###################################################
library(foreign) # load the package
df <- read.spss(file = ..., to.data.frame = TRUE) # import your own datafile
###################################################
# part b #
###################################################
df.w <- df[df$sex == "female", ]
fit <- lm(... ~ ..., data = df.w)
plot(y = ..., x = ...)
abline(..., col = "red")
###################################################
# part c #
###################################################
summary(...)
###################################################
# part d #
###################################################
library(ggplot2)
ggplot(df.w, aes(y = ..., x = ...)) + geom_smooth(method = "lm")
###################################################
# part e #
###################################################
plot(...)
###################################################
# part a #
###################################################
library(foreign) # load the package
df <- read.spss(file = "./data/BLDPRES.sav", to.data.frame = TRUE) # import your own datafile
###################################################
# part b #
###################################################
df.w <- df[df$sex == "female", ]
fit.w <- lm(sys ~ age, data = df.w)
plot(y = df.w$sys, x = df.w$age)
abline(fit.w, col = "red")
###################################################
# part c #
###################################################
summary(fit.w)
###################################################
# part d #
###################################################
library(ggplot2)
ggplot(df.w, aes(y = sys, x = age)) + geom_smooth(method = "lm")
###################################################
# part e #
###################################################
plot(fit.w)
##
## Call:
## lm(formula = sys ~ age, data = df.w)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.639 -9.642 -1.144 9.356 51.864
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 105.64871 3.23658 32.642 < 2e-16 ***
## age 0.49983 0.07688 6.501 7.52e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.23 on 181 degrees of freedom
## Multiple R-squared: 0.1893, Adjusted R-squared: 0.1848
## F-statistic: 42.26 on 1 and 181 DF, p-value: 7.52e-10
The regression was significant (p<0.001). The regression equation is: sys = 105.6 + 0.50 x age. Per year increase of age, 0.50 increase in systolic blood pressure is the result. The 95% CI is 0.50±1.96x0.077: 0.35 - 0.65. The R2 should be interpreted as the % variability in systolic blood pressure explained by the model (intercept and age), which is 18.9% in this case. The first plot is showing the residuals plotted against the fitted values. It is observed that the residuals are normally distributed around the mean, and with equal standard deviation for each value of x (homoscedascity). The second plot (Q-Q plot) is better able to inform about the normality assumption, which holds for this regression. The other two plots show the leverage of each individual patient to the fitted model. There are no clear outliers.
###################################################
# part b #
###################################################
plot(...)
abline(fit.w, col = "red")
abline(fit.m, col = "green")
legend("topleft", legend = c("...", "..."), col = c("...", "..."), lty = 1)
###################################################
# part a #
###################################################
df.m <- df[df$sex == "male", ]
fit.m <- lm(sys ~ age, data = df.m)
plot(y = df.m$sys,x = df.m$age)
abline(fit.m, col = "red")
summary(fit.m)
plot(fit.m)
###################################################
# part b #
###################################################
plot(y = df$sys, x = df$age)
abline(fit.w, col = "red")
abline(fit.m, col = "green")
legend("topleft", legend = c("Female", "Male"), col = c("red", "green"), lty = 1)
##
## Call:
## lm(formula = sys ~ age, data = df.m)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.453 -10.832 -1.377 8.435 49.901
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 115.56903 3.93895 29.340 <2e-16 ***
## age 0.24436 0.09444 2.587 0.0107 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.33 on 143 degrees of freedom
## Multiple R-squared: 0.04472, Adjusted R-squared: 0.03804
## F-statistic: 6.695 on 1 and 143 DF, p-value: 0.01067
It is observed that the slope of men (0.24) is lower than for women (0.50). For men as well, the normality and homoscedascicity assumption is fulfilled. In the scatter plot for both sexes, it is observed that the lines cross.
This exercise concerns the data set BLDPRES.SAV. We want to study with multiple regression in R the joint relationship of age, body weight and pulse rate with the diastolic blood pressure.
###################################################
# part a #
###################################################
library(foreign)
read.spss(..., to.data.frame = TRUE)
###################################################
# part b and c #
###################################################
fit <- lm(formula = ..., data = df) # do a linear regression
plot(...) # plot the scatterplot
abline(fit, col = "red") # add the regression line over the plot
###################################################
# part d #
###################################################
summary(fit)
###################################################
# part a #
###################################################
library(foreign) # load the package
df <- read.spss(file = "./Data/BLDPRES.SAV", to.data.frame = TRUE) # import your own datafile
###################################################
# part b #
###################################################
fit.age <- lm(dias~age, data=df) # do a linear regression of age on dias
plot(y = df$dias, x = df$age, ylab = "Diastolic blood pressure", xlab = "Age in years", main = "Dias versus age") # plot the scatterplot
abline(fit.age, col = "red") # add the regression line over the plot
###################################################
# part c #
###################################################
# with weight as independent variable
fit.weight <- lm(dias ~ weight, data = df) # do a linear regression of weight and pulse on dias
plot(y = df$dias, x = df$weight, ylab = "Diastolic blood pressure", xlab = "Weight", main = "Dias versus weight") # plot the scatterplot
abline(fit.weight, col = "red") # add the regression line over the plot
# with pulse as independent variable
fit.pulse <- lm(dias ~ pulse, data = df) # do a linear regression of weight and pulse on dias
plot(y = df$dias, x = df$pulse, ylab = "Diastolic blood pressure", xlab = "pulse", main = "Dias versus pulse") # plot the scatterplot
abline(fit.pulse, col = "red") # add the regression line over the plot
###################################################
# part d #
###################################################
summary(fit.age)
summary(fit.weight)
summary(fit.pulse)
##
## Call:
## lm(formula = dias ~ age, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.2944 -7.2955 -0.1049 6.4850 27.6855
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 77.99622 1.55712 50.090 <2e-16 ***
## age 0.07997 0.03714 2.153 0.032 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.13 on 326 degrees of freedom
## Multiple R-squared: 0.01402, Adjusted R-squared: 0.011
## F-statistic: 4.636 on 1 and 326 DF, p-value: 0.03204
##
## Call:
## lm(formula = dias ~ weight, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.2729 -6.6191 0.2111 5.6101 30.8809
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.99104 2.42989 27.570 < 2e-16 ***
## weight 0.27564 0.04622 5.963 6.44e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.69 on 326 degrees of freedom
## Multiple R-squared: 0.09835, Adjusted R-squared: 0.09558
## F-statistic: 35.56 on 1 and 326 DF, p-value: 6.437e-09
##
## Call:
## lm(formula = dias ~ pulse, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.8296 -6.8272 -0.1902 6.1872 28.6704
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.04171 3.96098 18.188 <2e-16 ***
## pulse 0.11057 0.04773 2.316 0.0212 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.12 on 326 degrees of freedom
## Multiple R-squared: 0.01619, Adjusted R-squared: 0.01317
## F-statistic: 5.366 on 1 and 326 DF, p-value: 0.02116
We can see that the coefficients of age, weight and pulse are 0.08, 0.28 and 0.11 respectively. The p-values of the Wald tests show p-values <0.05 for all three coefficients. This can interpreted by saying that one unit increase in the variable results significantly in one unit increase in diastolic blood pressure. For example: one year increase in age results in 0.08 mmHg increase of diastolic blood pressure. The percentage of explained variability in the diastolic blood pressure is 1.4%, 9.8% and 1.6% for age, weight and pulse, respectively. This means that weight explains the variability in diastolic blood pressure the most of the three investigated variables.
plot(...) # inserting a data.frame with the columns of interest will plot all pairwise scatterplots
corr <- cor(x =... , # insert the same data.frame
use = "pairwise.complete.obs", method = "...") # calculate correlation coefficients. Spearman or Pearson?
install.packages("corrplot") # the first time
corrplot::corrplot(corr, method = "square", type ="upper") # display graphically the correlation between variables, from the corrplot package
df.cor <- df[, c("pulse", "weight", "age")]
plot(df.cor)
corr <-cor(x = df.cor, use = "pairwise.complete.obs", method = "pearson")
corrplot::corrplot(corr, method = "square", type ="upper")
We can observe that little correlation is found between the variables. Visual inspection of the scatterplots showed normality of the binomial distributions, therefore we used the pearson correlation coefficient to inspect correlation. Since correlation is low, we can include all three in the model.
lm(dias ~ ... + ... + ..., data = df) #For multiple regression, use the + sign between predictors
###################################################
# part a #
###################################################
fit.mult <- lm(dias ~ age + weight + pulse, data = df)
###################################################
# part b #
###################################################
summary(fit.mult)
###################################################
# part c #
###################################################
plot(fit.mult)
##
## Call:
## lm(formula = dias ~ age + weight + pulse, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.1432 -6.1882 -0.0231 5.4479 28.8880
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.49560 4.60553 11.833 < 2e-16 ***
## age 0.08023 0.03510 2.286 0.0229 *
## weight 0.28194 0.04556 6.188 1.84e-09 ***
## pulse 0.10996 0.04516 2.435 0.0154 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.544 on 324 degrees of freedom
## Multiple R-squared: 0.1308, Adjusted R-squared: 0.1227
## F-statistic: 16.25 on 3 and 324 DF, p-value: 7.332e-10
All three variables have a significant effect on diastolic blood pressure. The interpretation is as follows: conditional on weight and pulse (“keeping these variables constant”), a year increase in age results in a 0.08 mmHg increase in diastolic blood pressure. The interpretation of the other coefficients are similar.The R-squared should be interpreted as the fraction of variance explained by the model, which is 0.13. The adjusted R-squared is adjusted for the amount of degrees of freedom spent in the analysis. Since this is lower than the normal R-squared, a low amount of degrees of freedom are spent for a high explained variance. The highest R-squared of the univariable analysis was weight: 0.098. The three variables combined are more explaining the variance of diastolic combined than each of them seperately.The F-test is the anova test with the H0: beta(age)=beta(weight)=beta(pulse)=0. The value of 16.25 on 3 df results a p-value <0.001. Therefore this model is significantly explaining diastolic blood pressure.
The first plot shows the residuals of diastolic blood pressure plotted versus the fitted values of the model. We are looking for normality of the y axis, because we assumed that the residuals are normally distributed for each value of x. We also do not observe heteroscedasticity in the residuals: the variation around the mean is equal for each value of the fitted value. The next plot shows how the standardized residuals correspond to what they would be when theoretically they would be normally distributed. Since the dots follow the diagonal, this means that the residuals are normally distributed for each value of x. The third plot is similar to the first plot, but then the square root of the residuals are plotted against the fitted values. The information from this plot is however similar to the first plot. The third plot indicates whether outliers are distorting the model significantly. This seems not to be the case.
This exercise concerns the data of the 25 cystic fibrosis patients given in the file CYSTFIBR.SAV. We want to predict with multiple regression the residual volume, RV, on the basis of body height and weight.
###################################################
# part a #
###################################################
library(foreign)
read.spss(..., to.data.frame = TRUE)
###################################################
# part b #
###################################################
lm(... ~ ..., data = ...) # to fit a linear model
summary(...) # to obtain summary statistics
###################################################
# part a #
###################################################
library(foreign) # load the package
df <- read.spss(file = "./data/BLDPRES.sav", to.data.frame = TRUE) # import your own datafile
###################################################
# part b #
###################################################
fit.height <- lm(RV ~ HEIGHT, data = df)
summary(fit.height)
fit.weight <- lm(RV ~ WEIGHT, data = df)
summary(fit.weight)
##
## Call:
## lm(formula = RV ~ HEIGHT, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.99 -41.91 -14.34 29.50 162.81
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 603.3592 105.7678 5.705 8.27e-06 ***
## HEIGHT -2.2785 0.6857 -3.323 0.00296 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 72.22 on 23 degrees of freedom
## Multiple R-squared: 0.3244, Adjusted R-squared: 0.295
## F-statistic: 11.04 on 1 and 23 DF, p-value: 0.002962
##
## Call:
## lm(formula = RV ~ WEIGHT, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.691 -58.980 -2.396 39.321 137.178
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 369.9091 33.1439 11.161 9.26e-11 ***
## WEIGHT -2.9869 0.7851 -3.805 0.000913 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 68.84 on 23 degrees of freedom
## Multiple R-squared: 0.3863, Adjusted R-squared: 0.3596
## F-statistic: 14.48 on 1 and 23 DF, p-value: 0.0009126
We can see that the coefficients of height and weight are both significant (p<0.05). Both regressions have a significant anova test, so the models significantly explains variation in RV (H0: beta(…)=0).
###################################################
# part a #
###################################################
lm(... ~ ... + ...) # this function should look familiar
###################################################
# part b #
###################################################
case <- c(..., ...) # specify a vector, with the height and weight
itcp + betas %*% case # the RV is equal to the intercept plus the
# matrix multiplication of the betas and the
# case (%in% sign)
###################################################
# part c #
###################################################
fit$residuals # to extract the residuals
plot(y = ..., x = ...) # plot these against...?
###################################################
# part a #
###################################################
fit <- lm(RV ~ WEIGHT + HEIGHT, data = df)
summary(fit)
###################################################
# part b #
###################################################
betas <- coef(fit)[2:3]
itcpt <- coef(fit)[1]
case <- c(WEIGHT = 45, HEIGHT = 150)
itcpt + betas%*%case
###################################################
# part c #
###################################################
res <- fit$residuals
plot(y = res, x = df$HEIGHT)
plot(y = res, x = df$WEIGHT)
##
## Call:
## lm(formula = RV ~ WEIGHT + HEIGHT, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.19 -58.34 -2.34 38.61 136.48
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 362.09272 191.90423 1.887 0.0724 .
## WEIGHT -3.06526 2.05664 -1.490 0.1503
## HEIGHT 0.07085 1.71209 0.041 0.9674
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 70.38 on 22 degrees of freedom
## Multiple R-squared: 0.3863, Adjusted R-squared: 0.3305
## F-statistic: 6.925 on 2 and 22 DF, p-value: 0.004649
## [,1]
## [1,] 234.7831
#### Discussion
We can see that both wald tests of the predictors are non-significant (p>0.05). This does not result in a non-significant model, because 6.925 F with 2 degrees of freedom result in a p=0.005. Therefore the model is significant. The reason that the Wald tests are non-significant is because these are highly colinear (high correlation between height and weight). The case presented should have a RV of 234.8. Furthermore, the plots show that there is no trend visible between the residuals and WEIGHT, but there is a trend visible between the residuals and HEIGHT.
###################################################
# part a #
###################################################
lm(... ~ ... + ... + I(... ^ 2)) # this is how to put a squared version in the formula
###################################################
# part b #
###################################################
anova(..., ..., test = "LRT") # which models should be compared...?
###################################################
# part a #
###################################################
fit2 <- lm(RV ~ WEIGHT + HEIGHT + I(HEIGHT ^ 2), data = df)
summary(fit2)
###################################################
# part b #
###################################################
anova(fit2, fit.weight, test ="LRT")
##
## Call:
## lm(formula = RV ~ WEIGHT + HEIGHT + I(HEIGHT^2), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -100.439 -53.586 -6.107 39.339 138.612
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 669.59734 756.08388 0.886 0.386
## WEIGHT -3.61590 2.47071 -1.464 0.158
## HEIGHT -4.38336 10.72203 -0.409 0.687
## I(HEIGHT^2) 0.01657 0.03936 0.421 0.678
##
## Residual standard error: 71.73 on 21 degrees of freedom
## Multiple R-squared: 0.3915, Adjusted R-squared: 0.3045
## F-statistic: 4.503 on 3 and 21 DF, p-value: 0.01369
## Analysis of Variance Table
##
## Model 1: RV ~ WEIGHT + HEIGHT + I(HEIGHT^2)
## Model 2: RV ~ WEIGHT
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 21 108062
## 2 23 108983 -2 -920.7 0.9144
The answer to a is that the squared version of height does not contribute significantly to the model: the Wald test shows a p=0.678. When testing for joint significance of Height and height squared, the likelihood ratio test shows a p=0.914. H0 of this test is that the model without versus the model with height and height squared are equally predictive for RV. This hypothesis can not be rejected.
This exercise again concerns the data of the 25 cystic fibrosis patients given in the file:_CYSTFIBR.SAV_. The problem is about the difference in mean FEV1 between male and female patients, adjusted for body weight by covariance analysis.
###################################################
# part a #
###################################################
install.packages("foreign") # install package first time
library(foreign)
read.spss(..., to.data.frame = TRUE)
###################################################
# part b #
###################################################
fit.crude <- lm(...)
summary(...)
confint(...) # to obtain the confidence intervals of the coefficients
###################################################
# part c #
###################################################
t.test(formula = ..., data = df) # what formula to put in...?
###################################################
# part a #
###################################################
install.packages("foreign")
library(foreign)
df <- read.spss("./Data/CYSTFIBR.SAV", to.data.frame = TRUE)
###################################################
# part b #
###################################################
fit.crude <- lm(FEV1 ~ SEX, data = df)
summary(fit.crude)
confint(fit.crude)
###################################################
# part c #
###################################################
t.test(FEV1 ~ SEX, var.equal = TRUE, data = df)
##
## Call:
## lm(formula = FEV1 ~ SEX, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.8571 -6.8571 -0.8571 5.1429 17.1429
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.857 2.596 15.356 1.4e-13 ***
## SEXfemale -11.675 3.913 -2.984 0.00664 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.712 on 23 degrees of freedom
## Multiple R-squared: 0.2791, Adjusted R-squared: 0.2477
## F-statistic: 8.903 on 1 and 23 DF, p-value: 0.006639
## 2.5 % 97.5 %
## (Intercept) 34.48775 45.226540
## SEXfemale -19.77000 -3.580653
##
## Two Sample t-test
##
## data: FEV1 by SEX
## t = 2.9837, df = 23, p-value = 0.006639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.580653 19.769996
## sample estimates:
## mean in group male mean in group female
## 39.85714 28.18182
We can observe that two beta’s are in the linear model: the intercept and the effect of SEX on FEV1. The second coefficient is the coefficient of interest. It can be interpreted as: if SEX=female, FEV1 is 11.675 lower than if SEX=male. This difference is significant, with a p=0.0066. The 95% confidence interval of this difference is -19.77 - -3.58. This confidence interval excludes zero, because the difference is significant. If we check this with a t-test, we can see that the p-value and confidence interval is equal to the ones obtained in the regression. Note that the function t.test() assumes non-equal variances of FEV1 witin the groups of SEX. Since the regression assumes this, we obtain a slightly different answer when we do not specify var.equal=TRUE.
###################################################
# part b #
###################################################
lm(... ~ ... + ..., data = df)
###################################################
# part a #
###################################################
t.test(WEIGHT ~ SEX, data = df)
###################################################
# part b #
###################################################
fit.adj <- lm(FEV1 ~ SEX + WEIGHT, data = df)
summary(fit.adj)
confint(fit.adj)
##
## Welch Two Sample t-test
##
## data: WEIGHT by SEX
## t = 0.9771, df = 22.451, p-value = 0.3389
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.535395 20.991239
## sample estimates:
## mean in group male mean in group female
## 41.36429 34.63636
##
## Call:
## lm(formula = FEV1 ~ SEX + WEIGHT, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.181 -5.849 -1.468 5.330 16.986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.5064 4.9609 6.149 3.44e-06 ***
## SEXfemale -10.1544 3.7028 -2.742 0.0119 *
## WEIGHT 0.2261 0.1048 2.157 0.0422 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.022 on 22 degrees of freedom
## Multiple R-squared: 0.4049, Adjusted R-squared: 0.3508
## F-statistic: 7.484 on 2 and 22 DF, p-value: 0.003316
## 2.5 % 97.5 %
## (Intercept) 20.218181138 40.7946241
## SEXfemale -17.833608041 -2.4752360
## WEIGHT 0.008691923 0.4434247
It is observed that adding weight in the regression model results in a significant effect of WEIGHT on FEV1. The estimated difference between males and females in FEV1, adjusted for weight, is 10.15 (95% CI: 2.48 - 17.83). This is slightly different from the non-adjusted difference (which was 11.675, 95% CI: 3.58 - 19.77)
###################################################
# part a #
###################################################
fit.fem <- lm(... ~ ..., data = df[df$SEX == "female", ])
fit.men <- lm(... ~ ..., data = df[df$SEX == "male", ])
plot(...)
abline(...)
###################################################
# part b #
###################################################
lm(...~...*...) #the "*" means "+", and adds the interaction
###################################################
# part a #
###################################################
fit.fem <- lm(FEV1 ~ WEIGHT, data = df[df$SEX == "female", ])
fit.men <- lm(FEV1 ~ WEIGHT, data = df[df$SEX == "male", ])
plot(df$FEV1 ~ df$WEIGHT)
abline(fit.fem, col = "red")
abline(fit.men, col = "blue")
###################################################
# part b #
###################################################
fit.int <- lm(FEV1 ~ WEIGHT * SEX, data = df)
summary(fit.int)
##
## Call:
## lm(formula = FEV1 ~ WEIGHT * SEX, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.8606 -6.1095 -0.9119 4.7378 17.0240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.69290 5.67831 5.229 3.5e-05 ***
## WEIGHT 0.24573 0.12370 1.986 0.0602 .
## SEXfemale -7.31330 9.72627 -0.752 0.4604
## WEIGHT:SEXfemale -0.07821 0.24668 -0.317 0.7543
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.212 on 21 degrees of freedom
## Multiple R-squared: 0.4077, Adjusted R-squared: 0.3231
## F-statistic: 4.819 on 3 and 21 DF, p-value: 0.01047
we can see that the lines in the plot are slightly diverging. However, in this sample (between 20 and 70 kg) they do not cross. The interaction term in the model (WEIGHT:SEXfemale) is not significant (p=0.754), therefore we conclude that the lines regress parallel.
This exercise concerns the data set BLDPRES.SAV. We want to investigate whether there are differences in mean systolic blood pressure between the three social-economic status groups, adjusted for differences in body weight.
###################################################
# part a #
###################################################
install.packages("foreign") #install package first time
library(foreign)
read.spss(..., to.data.frame = TRUE)
###################################################
# part b #
###################################################
class(...) #to check the class
factor(...) #to recode the variable in a factor
contrasts(...) #to obtain the coding for the regression
###################################################
# part d #
###################################################
summary(aov(...)) #to run a one-way anova
###################################################
# part a #
###################################################
install.packages("foreign") #install package first time
library(foreign)
df <- read.spss("./Data/BLDPRES.SAV", to.data.frame = TRUE)
###################################################
# part b #
###################################################
class(df$ses)
contrasts(df$ses)
###################################################
# part c #
###################################################
fit.crude <- lm(sys ~ ses, data = df)
summary(fit.crude)
confint(fit.crude)
###################################################
# part d #
###################################################
summary(aov(df$sys ~ df$ses))
## [1] "factor"
## middle class upperclass
## lower class 0 0
## middle class 1 0
## upperclass 0 1
##
## Call:
## lm(formula = sys ~ ses, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.736 -11.685 -2.736 7.264 62.778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 124.185 1.453 85.470 < 2e-16 ***
## ses middle class -1.449 2.099 -0.690 0.49058
## sesupperclass 8.037 2.595 3.097 0.00213 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.07 on 325 degrees of freedom
## Multiple R-squared: 0.04072, Adjusted R-squared: 0.03482
## F-statistic: 6.898 on 2 and 325 DF, p-value: 0.001164
## 2.5 % 97.5 %
## (Intercept) 121.326369 127.04320
## ses middle class -5.577575 2.68045
## sesupperclass 2.931768 13.14311
## Df Sum Sq Mean Sq F value Pr(>F)
## df$ses 2 4019 2009.7 6.898 0.00116 **
## Residuals 325 94684 291.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The analysis of variance (ANOVA) test shows a p-value of 0.00116. This is exactly the same p-value as for the linear model (note, compared to the overall ANOVA test, not to the individual coefficients). we see a significant difference in the three ses groups. However, from the regression, we can observe that the upperclass has a higher systolic blood pressure than the other two groups, which do not differ significantly.
###################################################
# part c #
###################################################
lm(... ~ ... * ...)
###################################################
# part a #
###################################################
fit.adj <- lm(sys ~ ses + weight, data = df)
###################################################
# part b #
###################################################
fit.weight <- lm(sys ~ weight, data = df)
anova(fit.adj, fit.weight)
###################################################
# part C #
###################################################
fit.int <- lm(sys ~ ses * weight, data = df)
summary(fit.int)
##
## Call:
## lm(formula = sys ~ ses + weight, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.005 -10.230 -3.373 6.680 61.855
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 112.21601 4.35871 25.745 < 2e-16 ***
## ses middle class -3.18392 2.15923 -1.475 0.14130
## sesupperclass 4.58305 2.82755 1.621 0.10602
## weight 0.25945 0.08921 2.908 0.00388 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.88 on 324 degrees of freedom
## Multiple R-squared: 0.06513, Adjusted R-squared: 0.05647
## F-statistic: 7.524 on 3 and 324 DF, p-value: 6.995e-05
## Analysis of Variance Table
##
## Model 1: sys ~ ses + weight
## Model 2: sys ~ weight
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 324 92275
## 2 326 94776 -2 -2501.5 0.01238 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = sys ~ ses * weight, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.839 -10.110 -3.473 6.841 62.100
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 113.70175 8.00943 14.196 <2e-16 ***
## ses middle class -7.78273 10.71896 -0.726 0.468
## sesupperclass 7.18875 13.23063 0.543 0.587
## weight 0.22725 0.17080 1.331 0.184
## ses middle class:weight 0.09115 0.21576 0.422 0.673
## sesupperclass:weight -0.03662 0.24345 -0.150 0.881
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.92 on 322 degrees of freedom
## Multiple R-squared: 0.06627, Adjusted R-squared: 0.05177
## F-statistic: 4.571 on 5 and 322 DF, p-value: 0.0004837
It is observed that when correcting for weight, the ses classes do not differ significantly from each other anymore (p>0.05). However, when comparing the model with ses versus the model without ses using a likelihood ratio test, it is observed that the p=0.012. In conclusion, ses still is predicting sys in the multivariable model, but the groups do not differ significantly from each other. Furthermore, it is safe to assume that the difference in systolic blood pressure is not dependent on weight, because the interaction terms (sesmidlleclass:weight and sesupperclass:weight) between the variables ses and weight are not signinficant (p=0.67 and p=0.88, respectively).
Now (again for the BLDPRES-data) we want to investigate whether there are differences in mean diastolic blood pressure between the three social economic status groups, adjusted for differences in pulse rate.
###################################################
# part c #
###################################################
# fit a new model with interaction between SES class and pulse
# then plot the regression lines for the different SES classes
ggplot(df, aes(y = ..., x = ..., color = ...)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
###################################################
# part d #
###################################################
# check the model summary again to analyze the regression equitation
summary(...)
###################################################
# part a #
###################################################
library(foreign)
df <- read.spss("./data/BLDPRES.sav", to.data.frame = TRUE)
class(df$ses)
contrasts(df$ses)
###################################################
# part b #
###################################################
fit1 <- lm(dias ~ ses + pulse, data = df)
summary(fit1) #coefficients
# beta high SES: 0.033, P = 0.979 (reference category: low SES)
# beta mid SES: 3.322, P = 0.031 (reference category: low SES)
# beta pulse: 0.107, P = 0.026
# note that a model including dummies for SES class will give the same results
levels(df$ses) # please note: there is a space before " middle class"
mid.ses <- ifelse(df$ses == " middle class", 1, 0)
high.ses <- ifelse(df$ses == "upperclass", 1, 0)
fit2 <- lm(dias ~ mid.ses + high.ses + pulse, data = df)
summary(fit2)
###################################################
# part c #
###################################################
fit3 <- lm(dias ~ pulse * ses, data = df)
summary(fit3)
library(ggplot2)
ggplot(df, aes(y = pulse, x = dias, color = ses)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
###################################################
# part d #
###################################################
summary(fit3)
# regression equitation:
# dias = 74.56 + 0.07 * pulse + 6.68 * ses middle class - 30.47 * sesupperclass - 0.08 * pulse * ses middle class + 0.41 * pulse * sesupperclass
# predictions for low SES: dias = 74.56 + 0.07 * pulse
# predictions for middle SES: dias = 74.56 + 0.07 * pulse + 6.68 - 0.08 * pulse
# ---------------------------------------------------
# difference: 6.68 - 0.08 * pulse
# predictions for low SES: dias = 74.56 + 0.07 * pulse
# predictions for high SES: dias = 74.56 + 0.07 * pulse - 30.47 + 0.41 * pulse
# ---------------------------------------------------
# difference: - 30.47 + 0.41*pulse
## re-encoding from CP1252
## [1] "factor"
## middle class upperclass
## lower class 0 0
## middle class 1 0
## upperclass 0 1
##
## Call:
## lm(formula = dias ~ ses + pulse, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.1714 -6.8993 -0.2905 6.0025 29.3286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.71777 4.00409 17.911 <2e-16 ***
## ses middle class 0.03281 1.23855 0.026 0.9789
## sesupperclass 3.32237 1.53131 2.170 0.0308 *
## pulse 0.10659 0.04753 2.243 0.0256 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.07 on 324 degrees of freedom
## Multiple R-squared: 0.03257, Adjusted R-squared: 0.02361
## F-statistic: 3.635 on 3 and 324 DF, p-value: 0.0132
##
## Call:
## lm(formula = dias ~ pulse * ses, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.1123 -7.4857 -0.4016 5.3243 29.3877
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.56009 6.47520 11.515 < 2e-16 ***
## pulse 0.07205 0.07802 0.923 0.35645
## ses middle class 6.68121 8.59382 0.777 0.43747
## sesupperclass -30.47143 11.29895 -2.697 0.00737 **
## pulse:ses middle class -0.08181 0.10378 -0.788 0.43107
## pulse:sesupperclass 0.40738 0.13529 3.011 0.00281 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.88 on 322 degrees of freedom
## Multiple R-squared: 0.07416, Adjusted R-squared: 0.05978
## F-statistic: 5.158 on 5 and 322 DF, p-value: 0.000144
In the first model, there was a significant effect of high SES and pulse on diastolic bloodpressure, but no significant effect of middle SES. The interaction term between high SES and pulse (in the second model) was significant, so the effect of high SES is different for different pulse rates.
aov(... ~ ..., data = ...)
# aov() is automatically fitting the variables sequentially (i.e. adding it to the model in the order specified)
# instead, we want to fit the variables marginally (i.e. treating it as the last variable added to the model)
# use the drop1 function to correct this
drop1(..., ~., test = "F")
# also take a look at the diagnostic plots
layout(matrix(c(1, 2, 3, 4), 2, 2)) # create a 2x2 layout for the plots
plot(...)
fit4 <- aov(dias ~ pulse * ses, data = df)
drop1(fit4, ~., test = "F")
coefficients (fit4)
# simular results as the previous fit3 model
# also take a look at the diagnostic plots
layout(matrix(c(1, 2, 3, 4), 2, 2)) # create a 2x2 layout for the plots
plot(fit4)
## Single term deletions
##
## Model:
## dias ~ pulse * ses
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 31433 1508.5
## pulse 1 83.25 31516 1507.4 0.8528 0.3564499
## ses 2 1162.47 32596 1516.4 5.9542 0.0028894 **
## pulse:ses 2 1412.16 32845 1519.0 7.2331 0.0008457 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Intercept) pulse ses middle class
## 74.56008611 0.07205073 6.68121194
## sesupperclass pulse:ses middle class pulse:sesupperclass
## -30.47143064 -0.08181469 0.40737847
The results from the ANOVA model (with sequential variable testing) are the same as from the linear regression model. The residual plot shows no evident non-linearity, the scale-location plot shows no clear signs of non-equal variance, the normal Q-Q plot shows no violation of the normality assumption and the residuals vs leverage plot shows no influential outliers.